|
Reservoir sampling is a family of randomized algorithms for randomly choosing a sample of ''k'' items from a list ''S'' containing ''n'' items, where ''n'' is either a very large or unknown number. Typically ''n'' is large enough that the list doesn't fit into main memory. == Algorithm R == The most common example was labelled ''Algorithm R'' by Jeffrey Vitter in his paper on the subject. This simple O(''n'') algorithm as described in the ''Dictionary of Algorithms and Data Structures''〔( Dictionary of Algorithms and Data Structures )〕 consists of the following steps (assuming that the arrays are one-based, and that the number of items to select, ''k'', is smaller than the size of the source array, ''S''): The algorithm creates a "reservoir" array of size ''k'' and populates it with the first ''k'' items of ''S''. It then iterates through the remaining elements of ''S'' until ''S'' is exhausted. At the ''i''th element of ''S'', the algorithm generates a random number ''j'' between 1 and ''i''. If ''j'' is less than or equal to ''k'', the ''j''th element of the reservoir array is replaced with the ''i''th element of ''S''. In effect, for all ''i'', the ''i''th element of ''S'' is chosen to be included in the reservoir with probability ''k/i''. Similarly, at each iteration the ''j''th element of the reservoir array is chosen to be replaced with probability ''1/k * k/i'', which simplifies to ''1/i''. It can be shown that when the algorithm has finished executing, each item in ''S'' has equal probability (i.e. ''k/length(S)'') of being chosen for the reservoir. To see this, consider the following proof by induction. After the (i-1)th round, let us assume, the probability of a number being in the reservoir array is ''k/(i-1)''. Since the probability of the number being replaced in the ithround is ''1/i'', the probability that it survives the ith round is ''(i-1)/i''. Thus, the probability that a given number is in the reservoir after the ith round is the product of these two probabilities, i.e. the probability of being in the reservoir after the (i-1)th round, and surviving replacement in the ith round. This is ''(k/(i-1)) * ((i-1)/i)=k/i''. Hence, the result holds for ''i'', and is therefore true by induction. 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「Reservoir sampling」の詳細全文を読む スポンサード リンク
|